Research Article
Machine Learning in Statistical Genetics: A Unified Framework from Representation Learning to Causal Inference 
Author
Correspondence author
Computational Molecular Biology, 2026, Vol. 16, No. 5
Received: 16 Jul., 2026 Accepted: 18 Aug., 2026 Published: 03 Sep., 2026
The genetic dissection of complex traits is undergoing a paradigm shift driven by high-dimensional data, nonlinear architectures, and multi-modal integration. Traditional statistical genetics approaches, centered on linear mixed models (LMMs), provide robust frameworks for population structure correction, effect estimation, and causal inference, yet remain limited in capturing higher-order interactions and complex regulatory patterns. In parallel, machine learning and artificial intelligence (ML/AI) methods have demonstrated superior predictive performance through representation learning and nonlinear modeling in large-scale genomic and multi-omics datasets. However, their outputs are largely confined to association-level findings and are often challenged by limited interpretability and portability. Here, we systematically reconceptualize the relationship between statistical genetics and ML/AI by proposing a unified analytical framework. In this framework, ML/AI functions as a representation layer, compressing high-dimensional signals and extracting latent structures, while statistical genetics serves as an inference layer, enabling effect estimation and hypothesis testing. These components are further integrated within a causal inference framework, forming a closed-loop pipeline from prediction to representation, inference, and causality. Within this context, we identify interpretability stability and result portability as key constraints governing ML/AI applications and highlight the role of structural priors—such as regulatory networks and causal graphs—in mitigating spurious associations and facilitating causal discovery. Our analysis demonstrates that the primary value of ML/AI lies not in replacing traditional statistical models, but in enhancing the representation of complex genetic signals, whereas statistical genetics remains indispensable for ensuring inferential validity and reproducibility. Future progress will depend on structurally constrained model integration, alongside the development of standardized benchmark datasets and evaluation frameworks. Such advances will enable a transition from association-based analysis toward causal understanding, ultimately unifying predictive performance with biological interpretability.
Statistical genetics has long focused on quantifying the relationship between genetic variation and complex traits, with its methodological foundation largely grounded in linear models and their extensions (Fang and Wu, 2026). Approaches such as linear mixed models (LMM), genomic-relatedness-based restricted maximum likelihood (GREML), and genome-wide association studies (GWAS) provide a coherent and interpretable framework for modeling additive genetic effects, enabling robust inference in population structure correction, heritability estimation, and locus-level association mapping (Grinberg et al., 2017; Schrider and Kern, 2018; Conard et al., 2023). These methods have established the dominant analytical paradigm in both human genetics and crop improvement.
However, as genomic datasets increase in both scale and complexity, the limitations of this framework have become increasingly evident. First, linear assumptions are insufficient to capture nonlinear mechanisms such as dominance, epistasis, and gene-environment (G×E) interactions. Second, the reliance on single-variant hypothesis testing under stringent multiple testing correction reduces power for detecting weak effects and can compromise reproducibility, contributing to the persistence of the “missing heritability” problem (Grinberg et al., 2017; Schrider and Kern, 2018). Together, these challenges suggest that low-complexity models alone may not adequately represent the genetic architecture of complex traits.
In response, machine learning (ML) and artificial intelligence (AI) methods have been increasingly incorporated into statistical genetics. In contrast to traditional approaches centered on parameter estimation and hypothesis testing, ML/AI emphasizes predictive performance, learning complex functions from high-dimensional data. Models such as random forests, support vector machines, and deep neural networks are capable of capturing nonlinear relationships and higher-order interactions within large-scale genotype-phenotype datasets, demonstrating improved performance in trait prediction, disease risk modeling, and multimodal data integration (Schrider and Kern, 2018; Vadapalli et al., 2022; Chaplot et al., 2023). The development of automated machine learning (AutoML) frameworks and scalable computing infrastructures has further lowered barriers to adoption, expanding their applicability across diverse biological settings (Manduchi et al., 2021; Novakovsky et al., 2022).
Despite these advantages, the integration of ML/AI into statistical genetics introduces a fundamental tension between predictive accuracy and interpretability. While high-capacity models often achieve superior predictive performance, their “black-box” nature limits the ability to directly relate model outputs to biologically interpretable mechanisms, thereby challenging transparency, reproducibility, and downstream translation (Murdoch et al., 2019; Azodi et al., 2020; Watson, 2021). More critically, in the presence of linkage disequilibrium (LD) and highly correlated features, ML models primarily capture statistical associations rather than causal effects, implying that high predictive accuracy does not necessarily equate to biological validity (Novakovsky et al., 2022). By contrast, traditional statistical genetics models, although less expressive, offer explicit assumptions and interpretable parameters that more readily align with causal inference frameworks (Murdoch et al., 2019; Watson, 2020).
Recent advances in interpretable and explainable AI (iML/XAI) have sought to address this gap. These approaches include inherently interpretable models (e.g., decision trees and rule-based systems) as well as post hoc explanation techniques such as feature attribution methods (e.g., SHAP, LIME) and attention-based interpretations. While these tools provide insights into model behavior, their interpretations remain dependent on the underlying data distribution and model structure, raising concerns regarding stability and causal validity in highly correlated genomic contexts (Murdoch et al., 2019; Watson, 2021). Thus, interpretability should not be viewed solely as a technical limitation, but rather as a structural tension between predictive modeling objectives and biological inference requirements.
This study will formalize the tension in the application of ML/AI in statistical genetics into three interrelated dimensions: the distinction between predictive and inferential objectives, the trade-off between model expressiveness and interpretability constraints, and the relationship between data-driven approaches and structural priors. Within this framework, the study systematically examines the scope and limitations of ML/AI methods in modeling complex genetic architectures, analyzes their advantages in capturing nonlinear patterns and integrating multimodal data, and highlights their limitations in terms of interpretability and reproducibility. Building on this analysis, we propose a practical framework that integrates structural constraints with stability assessment, aiming to establish an operational balance along the “prediction-interpretation” continuum and to provide methodological guidance for genetic research and molecular breeding practices.
1 A Comparative Framework for ML/AI and Traditional LMM
1.1 LMM as a baseline for structured and interpretable inference
Linear mixed models (LMMs) constitute a foundational methodological framework in statistical genetics for association analysis and heritability estimation. Their central idea is to decompose phenotypic variation into fixed effects and random genetic components, typically modeled via a genomic relationship matrix (GRM) that captures relatedness and population structure (Zhang et al., 2010; Runcie and Crawford, 2018; Jiang et al., 2021). Within this framework, SNP effects can be interpreted as marginal or conditional associations given the genetic background, providing a well-defined statistical estimand.
The primary strength of LMMs lies in their transparency and interpretability. Model components and parameters have clear statistical meaning, allowing direct linkage between estimated effects, variance components, and biological interpretation. Furthermore, by explicitly modeling genetic covariance, LMMs effectively control for confounding due to population stratification and relatedness, which is critical for maintaining type I error control in genome-wide association studies (Zhang et al., 2010; Jiang et al., 2021).
However, these strengths are inherently tied to restrictive modeling assumptions. The linear and additive structure limits the ability of LMMs to capture nonlinear genetic architectures, such as epistasis, threshold effects, and gene-environment (G×E) interactions. In high-dimensional settings (p ≫ n), statistical power to detect individual effects diminishes, contributing to the well-known phenomenon of “missing heritability” (Grinberg et al., 2017). Additionally, when linkage disequilibrium (LD) patterns are complex or data exhibit substantial heterogeneity, the covariance structure imposed by the model may deviate from the true genetic architecture, affecting estimation accuracy (Zingaretti et al., 2020).
From a statistical perspective, LMMs can thus be viewed as structured linear estimators that prioritize interpretability and bias control, at the cost of limited flexibility in representing complex genetic effects.
1.2 ML/AI as flexible function approximators for complex genetic structure
In contrast to the parametric modeling paradigm of LMMs, machine learning (ML) and artificial intelligence (AI) methods approach the problem as one of high-dimensional function approximation, learning a mapping directly from genotype to phenotype without imposing explicit structural constraints. The primary objective is typically to minimize prediction error rather than to estimate interpretable parameters.
This paradigm shift confers several advantages. First, ML/AI methods can handle high-dimensional, sparse, and highly correlated genomic data without relying on linearity or normality assumptions, using regularization and feature learning to extract informative patterns (Zingaretti et al., 2020; Elgart et al., 2022; Kelly and McLaughlin, 2024). Second, they are inherently capable of modeling nonlinear and higher-order interactions: tree-based models approximate nonlinear decision boundaries through recursive partitioning, kernel methods map data into higher-dimensional spaces, and deep neural networks capture complex hierarchical interactions, including epistasis and G×E effects. Third, these methods are highly scalable and adaptable, enabling integration of multi-modal data sources such as genomics, environmental variables, imaging, and longitudinal phenotypes within a unified modeling framework (Jones et al., 2023).
These advantages, however, come with important trade-offs. Because ML/AI models do not explicitly encode biological or statistical structure, their outputs often lack direct interpretability. In the presence of LD and correlated predictors, learned patterns may reflect statistical associations rather than biologically meaningful or causal effects (Novakovsky et al., 2022). Moreover, high-capacity models are more sensitive to sample size, hyperparameter tuning, and data quality; in small-sample or high-noise settings, they are prone to overfitting and instability (Grinberg et al., 2017).
Thus, ML/AI methods can be characterized as highly flexible predictive models that reduce bias at the expense of increased variance and reduced interpretability.
1.3 A Unified criterion: coupling structure, sample size, and objective
Rather than viewing LMM and ML/AI as competing approaches, their relative performance is better understood as context-dependent, governed by the interaction of three key factors: (i) sample size, (ii) underlying signal structure, and (iii) analytical objective (prediction vs. inference).
When sample sizes are large and the underlying genetic architecture departs substantially from linear assumptions—such as in the presence of nonlinear effects, epistasis, or complex interactions—ML/AI methods can achieve superior predictive performance by reducing model bias. This advantage is particularly pronounced in high-dimensional or multi-modal data settings (Elgart et al., 2022; Kelly and McLaughlin, 2024). Conversely, when sample sizes are limited and genetic effects are predominantly additive, LMMs tend to provide more stable and reproducible estimates due to their lower variance and structured modeling assumptions (Zingaretti et al., 2020).
Crucially, the distinction between these approaches also lies in their respective estimands. LMMs aim to estimate interpretable statistical quantities (e.g., SNP effects or variance components), whereas ML/AI methods optimize predictive performance without necessarily yielding interpretable parameters. As a result, LMMs remain preferable in contexts requiring causal interpretation, mechanistic insight, or cross-study comparability, while ML/AI methods are better suited for prediction, ranking, and risk estimation tasks.
In this light, method selection should not be framed as a question of superiority, but rather as a problem of optimally balancing bias, variance, and interpretability given the data structure and research objective. This unified perspective provides a conceptual foundation for integrating ML/AI approaches into broader causal inference frameworks in statistical genetics.
2 Core Methods and Functional Roles: A Unified View from Representation to Structure Learning
2.1 Feature selection as signal extraction under high dimensionality
In high-dimensional genomic settings, feature selection serves a dual role: it enhances predictive stability while translating statistical associations into biologically interpretable candidates. Sparse regression methods, such as LASSO and Elastic Net, perform simultaneous variable selection and shrinkage through penalized optimization, enabling the identification of informative markers within highly correlated genotype matrices shaped by linkage disequilibrium (LD) (Grinberg et al., 2017; Musolf et al., 2021).
Unlike marginal screening approaches, these models capture joint effects across variants, thereby reducing spurious associations and improving generalization. In both crop quantitative traits and human complex diseases, LASSO/Elastic Net can substantially reduce the dimensionality of candidate marker sets while maintaining predictive performance, improving interpretability and computational efficiency (López et al., 2023). In multi-omics contexts, upstream feature selection further mitigates heterogeneity-induced artifacts and alleviates computational burden (Libbrecht and Noble, 2015; Monaco et al., 2021).
From a functional perspective, feature selection can be viewed as the first-stage projection of genotype space, mapping high-dimensional variation into a reduced subset of candidate signals that are amenable to downstream modeling.
2.2 Dimensionality reduction as representation learning
When the number of features greatly exceeds the number of samples (p ≫ n), dimensionality reduction becomes essential for stabilizing inference by denoising data and addressing multicollinearity. Principal component analysis (PCA) remains the most widely used linear method, capturing dominant variance components and commonly serving as a tool for population structure correction in GWAS (Grinberg et al., 2017).
However, complex genetic architectures—particularly those involving nonlinear gene regulation or gene-by-environment (G×E) interactions—often require more flexible representations. Deep representation learning approaches, such as autoencoders (AEs), can learn compact nonlinear embeddings that preserve higher-order structure across heterogeneous data modalities (Korfmann et al., 2023). These latent representations provide a unified space in which genetic, environmental, and molecular signals can be jointly encoded.
Conceptually, dimensionality reduction constitutes a representation layer, bridging raw genomic inputs and downstream predictive models.
2.3 Modeling nonlinearity and higher-order interactions
Traditional linear mixed models (LMMs) are limited in their ability to capture dominance effects, epistasis, and complex G×E interactions. Machine learning models address this limitation by approximating nonlinear functions over genotype space.
Ensemble methods such as random forests (RF) and gradient boosting machines (GBM) capture nonlinear effects through recursive partitioning, while also providing measures of feature importance (Grinberg et al., 2017). Deep neural networks (DNNs) extend this capacity via hierarchical transformations, enabling the modeling of complex patterns across multiple layers of abstraction (Monaco et al., 2021).
More recently, graph neural networks (GNNs) incorporate biological network structure—such as gene regulatory networks or protein-protein interactions—into the learning process, enabling topology-aware inference (Korfmann et al., 2023). These models extend inference from individual variants to structured biological systems.
Within the unified framework, these approaches correspond to function learning, where the objective is to approximate the mapping:
rather than estimate individual marker effects.
2.4 Multi-omics integration as a hierarchical extension
Beyond single-layer modeling, multi-modal learning frameworks integrate genomics, transcriptomics, epigenomics, metabolomics, environmental variables, and phenotypic data to reconstruct the pathway from genetic variation to complex traits (Libbrecht and Noble, 2015).
Deep learning architectures—often combined with attention mechanisms or contrastive learning—can learn shared latent spaces across modalities, improving both predictive performance and mechanistic interpretability. In crop systems, such integration supports cross-environment prediction and stability analysis; in human genetics, it facilitates disease risk modeling and biomarker discovery (Vadapalli et al., 2022).
From a systems perspective, multi-omics integration represents a hierarchical extension of representation learning, enabling the modeling of biological processes across multiple regulatory layers.
2.5 Synthesis: ML/AI as a representation–function learning pipeline
Taken together, the methods described above can be unified into a structured analytical pipeline:
Feature selection reduces dimensionality at the signal level
Representation learning encodes latent structure
Function learning captures nonlinear genotype–phenotype mappings
Multi-omics integration extends inference across biological layers
This pipeline does not directly estimate causal effects; rather, it provides a high-dimensional function approximation framework that complements traditional statistical genetics.
To synthesize these roles, we present a unified representation-layer framework for ML/AI methods (Figure 1).
|
Figure 1 ML/AI as a representation and prediction layer in statistical genetics. Note: This framework illustrates the role of machine learning and artificial intelligence methods as a functional layer for high-dimensional representation and nonlinear mapping in genetic analyses. Starting from high-dimensional genomic and multi-omics inputs, sparse feature selection (e.g., LASSO, Elastic Net) reduces dimensionality and stabilizes signal extraction. Representation learning (e.g., PCA, autoencoders) constructs low-dimensional embeddings that capture covariance structure. Downstream models—including tree-based methods, deep neural networks, and graph neural networks—learn nonlinear mappings from genotype to phenotype, enabling the detection of complex interactions such as epistasis and gene-environment effects. Multi-omics integration further extends this framework by embedding heterogeneous data into a unified latent space. Importantly, the outputs of ML/AI models primarily reflect predictive associations rather than well-defined causal estimands, distinguishing this layer from statistical inference frameworks such as LMM or causal models. |
3 Structural Challenges in Interpretability and Reproducibility
3.1 Interpretability: the gap between predictive attribution and mechanistic explanation
With the increasing adoption of machine learning (ML) and artificial intelligence (AI) in statistical genetics, their advantages in predictive performance—particularly in settings characterized by nonlinear structures and higher-order interactions—have become well established. However, the internal representations learned by these models are typically encoded in high-dimensional parameter spaces or latent variables, which are not readily aligned with interpretable biological mechanisms. This creates a structural gap between predictive accuracy and mechanistic interpretability (Murdoch et al., 2019; Azodi et al., 2020; Van Hilten et al., 2024).
In genetic research, models are expected not only to predict accurately but also to generate biologically testable hypotheses. Consequently, feature importance measures or attribution scores derived from ML models cannot be directly equated with causal explanations. To address this limitation, explainable AI (XAI) methods have been introduced into genetic workflows. For example, SHAP quantifies feature contributions based on marginal effects, LIME approximates local decision boundaries for individual predictions, and attention mechanisms provide interpretable weighting schemes within deep models (Murdoch et al., 2019; Conard et al., 2023).
Nevertheless, in genomic data characterized by extensive linkage disequilibrium (LD) and multicollinearity, these attribution methods often exhibit instability. Highly correlated variants may share or compete for explanatory importance, leading to variability in feature rankings across datasets or model specifications. As a result, such interpretations are better understood as explanations of model behavior, rather than direct reflections of biological causality (Watson, 2020). From a methodological standpoint, XAI operates at the level of function approximation explanation, rather than data-generating mechanism explanation.
3.2 Reproducibility and portability: systematic sources of data dependence
Beyond interpretability, reproducibility and external portability represent critical challenges for ML/AI applications in genetics. Compared with parametric statistical models, ML methods are inherently more dependent on the data distribution from which they are trained. Learned representations often embed dataset-specific characteristics, making them sensitive to changes in population structure, LD patterns, environmental context, or measurement protocols (Drouin et al., 2018; Kim et al., 2020).
In practice, this manifests as feature selection instability, where variables identified as important vary substantially across data splits or independent cohorts, thereby limiting external validation. Additional sources of variability include sampling bias, batch effects, and differences in preprocessing pipelines. Moreover, insufficient reporting standards and limited code availability further hinder reproducibility and cumulative progress (Pineau et al., 2020).
From a statistical perspective, these issues reflect the limitations of empirical risk minimization (ERM) under distributional shift. Consequently, model evaluation must extend beyond within-sample performance to include cross-cohort and cross-ancestry validation.
3.3 Tension between predictive objectives and causal inference
A more fundamental source of tension arises from differences in objective functions. Statistical genetics emphasizes hypothesis-driven inference and causal interpretability, whereas ML methods are primarily optimized for predictive accuracy. As a result, even highly accurate predictive models may fail to identify which genetic factors are necessary or sufficient at the causal level (Murdoch et al., 2019; Watson, 2020).
This issue is particularly pronounced in complex trait analysis, where models may capture stable statistical associations that are driven by LD structure, environmental confounding, or mediated pathways rather than direct causal effects. Therefore, translating predictive associations into biological conclusions without additional validation poses a substantial risk.
To bridge this gap, three complementary strategies have emerged:
Structure-constrained modeling: Incorporating biological priors—such as gene networks, pathways, or causal graphs—into the modeling process to restrict the hypothesis space;
Stability analysis of explanations: Evaluating the consistency of feature importance under perturbations, resampling, or across independent datasets;
Hybrid evaluation frameworks: Assessing models jointly on predictive performance, interpretability, and biological plausibility, rather than optimizing a single criterion.
Within this perspective, ML/AI should not be viewed as a standalone inferential framework, but rather as a high-dimensional function approximation layer at the association level, whose outputs require further validation through downstream methods such as colocalization and causal inference.
4 Benchmarking and Evaluation Framework: From Performance Assessment to Structural Diagnostics
4.1 Benchmark design: from data collections to reference distributions
A central limitation in applying ML/AI methods in statistical genetics lies in the lack of standardized and reusable benchmarking frameworks. Existing studies are often conducted on heterogeneous datasets that differ in sample size, ancestry composition, phenotype definition, and environmental variability, thereby hindering comparability and external validation (Grinberg et al., 2017; Van Hilten et al., 2021).
The key issue is therefore not merely data availability, but the construction of a statistically meaningful benchmarking framework. An effective benchmark should satisfy three core criteria:
(i) Cross-domain coverage: including human population datasets (e.g., UK Biobank), crop multi-environment trial (MET) data, and multi-omics resources (e.g., GTEx);
(ii) Structural diversity: capturing variation in LD patterns, ancestry structure, and gene-environment interactions;
(iii) Controllability: integrating simulated data with known causal structures alongside real-world data (Sheehan and Song, 2015; Collin et al., 2020).
In particular, a dual-track design combining real and simulated data is essential. Real data ensure biological relevance, while simulations enable systematic evaluation of model behavior under controlled conditions, including bias–variance trade-offs and robustness limits (Schrider and Kern, 2018).
Within a unified framework, benchmark datasets serve as a shared reference distribution, enabling statistically grounded comparisons across methods.
4.2 Multi-layer evaluation: from predictive accuracy to structural stability
Evaluating ML/AI methods in genetics requires moving beyond single performance metrics toward a multi-dimensional evaluation framework that captures prediction, generalization, and interpretability (Azodi et al., 2020; Liang et al., 2020).
At the prediction layer, model performance is quantified using metrics such as and RMSE for continuous traits, or AUC for classification tasks. These metrics assess goodness-of-fit but remain confined to the level of statistical association.
At the generalization layer, the focus shifts to performance under distributional shifts, including cross-ancestry, cross-environment, and cross-cohort scenarios. Performance degradation can be quantified as:
These measures capture the model’s robustness under out-of-distribution conditions and are critical for assessing real-world applicability.
At the interpretation layer, the emphasis is on the stability of feature attribution rather than its magnitude. Stability can be assessed via resampling-based metrics such as Kendall’s τ or top-k overlap (Stab@k), which quantify the consistency of feature rankings across perturbations. These metrics help distinguish robust signals from spurious or unstable explanations.
Taken together, this three-layer structure reframes model evaluation as a structural diagnostic process: prediction reflects association strength, generalization reflects robustness to distributional variation, and interpretation reflects attribution stability.
4.3 Interpretable reporting: from outputs to auditable evidence
To enhance transparency and reproducibility, ML/AI results should be embedded within a structured reporting framework that explicitly documents data, model, and interpretation components (Van Hilten et al., 2021).
First, data transparency requires reporting sample size, ancestry composition, environmental context, and genotype quality control procedures. Second, model specification should include architecture, hyperparameters, training procedures, and data partitioning strategies to ensure reproducibility. Third, interpretation outputs should provide key features alongside uncertainty measures and stability metrics, supported by biological annotation. Finally, external validation should be treated as a standard requirement, with performance evaluated in independent datasets and accompanied by publicly available code and model documentation.
This framework shifts ML models from predictive tools to auditable inference units, enabling systematic evaluation and reuse.
4.4 Unified evaluation framework (Table 1)
Based on the above considerations, the evaluation of ML/AI methods in statistical genetics can be formalized as a multi-layer structure distinguishing association performance, generalization robustness, and interpretability stability. We summarize this framework as follows:
|
Table 1 Multi-layer evaluation framework for ML/AI in statistical genetics |
The value of this framework lies in three aspects:
(i) clarifying the statistical role of each metric;
(ii) preventing the misinterpretation of predictive performance as causal evidence;
(iii) providing a unified coordinate system for comparing different methods.
5 Discussion: A Unified Perspective from Predictive Representation to Causal Inference
5.1 Fundamental paradigm distinction: separation of inference and prediction objectives
Statistical genetics and machine learning differ fundamentally in their methodological objectives. The former is centered on inferential validity, relying on explicit model assumptions and parameter estimation to quantify genetic effects and test hypotheses. In contrast, the latter prioritizes predictive optimality, focusing on extracting generalizable patterns from high-dimensional, nonlinear, and structured data (Libbrecht and Noble, 2015; Schrider and Kern, 2018; Azodi et al., 2020).
This distinction reflects not a difference in methodological superiority, but rather two complementary representations of genetic signal. Statistical models interpret signal as interpretable structural parameters, whereas machine learning models treat it as learnable functional mappings. Their complementarity thus arises from providing different projections of the same underlying genetic architecture.
In practice, this leads to a natural division of labor: machine learning serves to compress complex signals and prioritize candidates, while statistical genetic frameworks perform effect estimation and causal validation. This “representation–inference” linkage enables predictive outputs to be translated into testable biological hypotheses.
5.2 Structure-dependent method selection: when does ML/AI provide an advantage?
The performance of ML/AI methods depends critically on data structure and signal characteristics, rather than reflecting a universally superior approach. Their advantages are most evident when:
sample size is sufficiently large to support complex model parameterization;
genetic architecture exhibits substantial nonlinearity or higher-order interactions (e.g., epistasis, G×E);
data are multi-modal and require integrative modeling;
the primary objective is prediction or ranking rather than mechanistic interpretation (Vadapalli et al., 2022; Sigala et al., 2023).
Under such conditions, ML/AI models can flexibly approximate complex genotype–phenotype mappings, often achieving superior predictive accuracy. The emergence of AutoML further lowers technical barriers, facilitating broader application in genomic contexts (Manduchi et al., 2021).
However, when sample sizes are limited, genetic effects are predominantly additive, or interpretability and causal inference are primary goals, the advantages of complex models diminish. In these settings, increased model complexity can inflate variance and destabilize interpretation, potentially compromising reliability (Musolf et al., 2021; Novakovsky et al., 2022). Method selection should therefore be viewed as a structure-dependent problem, rather than a fixed methodological preference.
5.3 The interpretability tension: from black box to stable explanation
The predictive gains of deep learning and ensemble methods often come at the cost of interpretability. This trade-off is particularly consequential in genetics, where the goal extends beyond prediction to identifying mechanistic pathways and actionable hypotheses (Azodi et al., 2020; Novakovsky et al., 2022).
Explainable AI (XAI) methods—such as SHAP, LIME, and attention mechanisms—offer tools for interpreting model outputs. However, in genomic data characterized by linkage disequilibrium and multicollinearity, feature attributions are often non-unique and sensitive to data perturbations, leading to instability across resampling or cohort variation (Murdoch et al., 2019; Watson, 2020).
Consequently, interpretability should not be treated merely as a visualization problem, but as a question of stability and reproducibility of explanations. Robust interpretation requires systematic evaluation through resampling, consistency metrics, and external validation, ideally combined with biological constraints such as functional annotations and pathway information.
This reframing elevates interpretation from a descriptive output to a testable statistical object, placing it on equal footing with predictive performance in model evaluation.
5.4 Toward causal inference: integrating structure and learning
A key limitation of ML/AI in statistical genetics is that its outputs often remain at the level of association or prediction, lacking direct causal interpretability. This limitation stems from the absence of explicit constraints on the underlying data-generating mechanism.
Future progress lies in integrating representation learning with causal inference frameworks, introducing structural constraints into model design. Potential directions include:
incorporating gene regulatory networks or pathway information as structural priors;
embedding structural equation models or causal graphs to encode dependencies;
applying counterfactual reasoning and stability analysis to eliminate spurious patterns;
enforcing causal consistency constraints during model training (Schrider and Kern, 2018; Sigala et al., 2023).
Within this framework, machine learning transitions from a purely predictive tool to a representation layer within a causal inference pipeline, providing structured inputs for downstream causal identification.
At the same time, the development of standardized benchmark datasets, unified evaluation criteria, and transparent reporting practices will be essential for ensuring reproducibility and comparability (Manduchi et al., 2021). Only under such conditions can ML/AI evolve from high-performance predictors into reliable tools for scientific inference.
6 Conclusion: An Integrative Framework from Predictive Representation to Causal Inference
This study provides a methodological synthesis of the relationship between statistical genetics and machine learning/artificial intelligence (ML/AI). The central conclusion is that these approaches are not interchangeable, but rather operate at different layers of information representation within complex genetic systems. Statistical genetics emphasizes testable causal inference and parameter interpretability, whereas ML/AI excels at representation learning and predictive optimization in high-dimensional spaces. Accordingly, their appropriate integration should not be framed as mere complementarity, but as a structured framework in which a prediction-driven representation layer feeds into statistical inference and ultimately supports causal identification.
Within this framework, the primary contribution of ML/AI lies in its ability to compress and reconstruct complex signals. Through nonlinear mappings and multi-modal integration, these methods can extract stable latent structures from high-dimensional data, thereby improving phenotype prediction and candidate prioritization. However, such outputs remain inherently at the level of association. Their biological relevance must be established through statistical genetic approaches that provide formal effect estimation and hypothesis testing. In other words, improved predictive performance does not directly translate into causal knowledge; a rigorous inferential pathway is required to bridge the two.
Two key constraints govern this transition. The first is interpretation stability. In the presence of linkage disequilibrium and high-dimensional collinearity, feature attribution is often non-unique and sensitive to perturbations. Reliable interpretation therefore requires systematic evaluation through resampling consistency, cross-cohort validation, and integration with functional annotations. The second is result portability. Differences in population structure, environmental context, and data generation processes frequently lead to degradation in model performance and interpretability when applied to external datasets. As such, findings derived from ML/AI models must be validated across populations or environments to ensure generalizability.
Advancing ML/AI toward causal inference further depends on the incorporation of structural constraints. Embedding gene regulatory networks, pathway information, or causal graph frameworks into model design can restrict the hypothesis space and reduce reliance on spurious correlations. Progress in this direction will not hinge on isolated methodological improvements, but on establishing a closed-loop system linking representation learning, structural constraints, and causal testing, thereby embedding data-driven models within a scientifically verifiable inference pipeline.
At the practical level, effective integration requires a foundation of standardized infrastructure. Open benchmark datasets, evaluation metrics that jointly capture predictive performance and interpretability stability, and transparent reporting standards are essential for reproducibility and fair comparison across methods. These elements will facilitate the transition of ML/AI from exploratory tools to robust components of statistical genetic analysis.
In summary, the future of statistical genetics lies not in replacing existing paradigms, but in achieving hierarchical integration: leveraging ML/AI for high-dimensional representation, employing statistical models for rigorous inference, and unifying both within a causal framework. Only through such integration can complex trait research simultaneously achieve predictive accuracy, mechanistic insight, and translational relevance, thereby advancing from association-based analysis toward causal understanding.
Author Contributions
Xuanjun Fang conducted this study, including literature review, data analysis, and the writing and revision of the manuscript. The author has read and approved the final version of the manuscript.
Acknowledgements
This work was supported by a Major Project of the National Natural Science Foundation of China (Grant No. 30490254).
Azodi C., Tang J., Shiu S., 2020, Opening the Black Box: Interpretable Machine Learning for Geneticists, Trends in Genetics, 36(6): 442-455.
https://doi.org/10.20944/preprints202002.0239.v1
Chaplot N., Pandey D., Kumar Y., Sisodia P.S., 2023, A comprehensive analysis of artificial intelligence techniques for the prediction and prognosis of genetic disorders using various gene disorders, Archives of Computational Methods in Engineering, 30(5): 3301-3323.
https://doi.org/10.1007/s11831-023-09904-1
Collin F., Durif G., Raynal L., Lombaert É., Gautier M., Vitalis R., Marin J.-M., Estoup A., 2020, Extending approximate Bayesian computation with supervised machine learning to infer demographic history from genetic polymorphisms using DIYABC Random Forest, Molecular Ecology Resources, 21(8): 2598-2613.
https://doi.org/10.1111/1755-0998.13413
Conard A.M., DenAdel A., Crawford L., 2023, A spectrum of explainable and interpretable machine learning approaches for genomic studies, Wiley Interdisciplinary Reviews: Computational Statistics, 15(5): e1617.
https://doi.org/10.1002/wics.1617
Drouin A., Letarte G., Raymond F., Marchand M., Corbeil J., Laviolette F., 2019, Interpretable genotype-to-phenotype classifiers with performance guarantees, Scientific Reports, 9(1): 4071.
https://doi.org/10.1038/s41598-019-40561-2
Elgart M., Lyons G., Romero-Brufau S., Kurniansyah N., Brody J., Guo X., Lin H., Raffield L., Gao Y., Chen H., De Vries P., Lloyd-Jones D., Lange L., Peloso G., Fornage M., Rotter J., Rich S., Morrison A., Psaty B., Levy D., Redline S., Sofer T., 2022, Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations, Communications Biology, 5(1): 856.
https://doi.org/10.1038/s42003-022-03812-z
Fang X.J., and Wu W.R., 2026, Evolution of statistical genetic paradigms: from linkage analysis and candidate gene strategies to GWAS, Molecular Plant Breeding, 24(9): 2817-2829.
Grinberg N.F., Orhobor O.I., King R.D., 2020, An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat, Machine Learning, 109(2): 251-277.
https://doi.org/10.1007/s10994-019-05848-5
Jiang L., Zheng Z., Fang H., Yang J., 2021, A generalized linear mixed model association tool for biobank-scale data, Nature Genetics, 53(11): 1616-1621.
https://doi.org/10.1038/s41588-021-00954-4
Jones D., Fornarelli R., Derbyshire M., Gibberd M., Barker K., Hane J., 2023, The pursuit of genetic gain in agricultural crops through the application of machine-learning to genomic prediction, Frontiers in Genetics, 14: 1186782.
https://doi.org/10.3389/fgene.2023.1186782
Kelly C., McLaughlin R., 2024, Comparison of machine learning methods for genomic prediction of selected Arabidopsis thaliana traits, PLOS ONE, 19(9): e0308962.
https://doi.org/10.1371/journal.pone.0308962
Kim A., Zaim S., Subbian V., 2020, Assessing reproducibility and veracity across machine learning techniques in biomedicine: A case study using TCGA data, International Journal of Medical Informatics, 141: 104148.
https://doi.org/10.1016/j.ijmedinf.2020.104148
Korfmann K., Gaggiotti O., Fumagalli M., 2023, Deep learning in population genetics, Genome Biology and Evolution, 15: evad008.
https://doi.org/10.1093/gbe/evad008
Liang M., Chang T., An B., Duan X., Du L., Wang X., Miao J., Xu L., Gao X., Zhang L., Li J., Gao H., 2021, A stacking ensemble learning framework for genomic prediction, Frontiers in Genetics, 12: 600040.
https://doi.org/10.3389/fgene.2021.600040
Libbrecht M., Noble W.S., 2015, Machine learning applications in genetics and genomics, Nature Reviews Genetics, 16: 321-332.
https://doi.org/10.1038/nrg3920
López O., González B., López A., Crossa J., 2023, Statistical machine-learning methods for genomic prediction using the SKM library, Genes, 14(5): 1003.
https://doi.org/10.3390/genes14051003
Manduchi E., Romano J., Moore J.H., 2021, The promise of automated machine learning for the genetic analysis of complex traits, Human Genetics, 141: 1529-1544.
https://doi.org/10.1007/s00439-021-02393-x
Monaco A., Pantaleo E., Amoroso N., Lacalamita A., Lo Giudice C., Fonzino A., Fosso B., Picardi E., Tangaro S., Pesole G., Bellotti R., 2021, A primer on machine learning techniques for genomic applications, Computational and Structural Biotechnology Journal, 19: 4345-4359.
https://doi.org/10.1016/j.csbj.2021.07.021
Murdoch J., Singh C., Kumbier K., Abbasi-Asl R., Yu B., 2019, Definitions, methods, and applications in interpretable machine learning, Proceedings of the National Academy of Sciences of the USA, 116: 22071-22080.
https://doi.org/10.1073/pnas.1900654116
Musolf A., Holzinger E., Malley J., Bailey-Wilson J., 2021, What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics, Human Genetics, 141: 1515-1528.
https://doi.org/10.1007/s00439-021-02402-z
Novakovsky G., Dexter N., Libbrecht M., Wasserman W., Mostafavi S., 2023, Obtaining genetics insights from deep learning via explainable artificial intelligence, Nature Reviews Genetics, 24(2): 125-137.
https://doi.org/10.1038/s41576-022-00532-2
Pineau J., Vincent-Lamarre P., Sinha K., Larivière V., Beygelzimer A., d’Alché-Buc F., Fox E., Larochelle H., 2021, Improving reproducibility in machine learning research: A report from the NeurIPS 2019 Reproducibility Program, Journal of Machine Learning Research, 22(164): 1-20.
http://jmlr.org/papers/v22/20-028.html
Runcie D., Crawford L., 2018, Fast and flexible linear mixed models for genome-wide genetics, PLOS Genetics, 15(2): e1007978.
https://doi.org/10.1371/journal.pgen.1007978
Schrider D.R., Kern A.D., 2018, Supervised machine learning for population genetics: A new paradigm, Trends in Genetics, 34(4): 301-312.
https://doi.org/10.1016/j.tig.2017.12.005
Sheehan S., Song Y.S., 2016, Deep learning for population genetic inference, PLOS Computational Biology, 12(3): e1004845.
https://doi.org/10.1371/journal.pcbi.1004845
Sigala R., Lagou V., Shmeliov A., Atito S., Kouchaki S., Awais M., Prokopenko I., Mahdi A., Demirkan A., 2023, Machine learning to advance human genome-wide association studies, Genes, 15(1): 34.
https://doi.org/10.3390/genes15010034
Vadapalli S., Abdelhalim H., Zeeshan S., Ahmed Z., 2022, Artificial intelligence and machine learning approaches using gene expression and variant data for personalized medicine, Briefings in Bioinformatics, 23(4): bbac191.
https://doi.org/10.1093/bib/bbac191
Van Hilten A., Katz S., Saccenti E., Niessen W., Roshchupkin G., 2024, Designing interpretable deep learning applications for functional genomics: A quantitative analysis, Briefings in Bioinformatics, 25(5): bbae449.
https://doi.org/10.1093/bib/bbae449
Van Hilten A., Kushner S.A., Kayser M., Ikram M., Adams H., Klaver C., Niessen W., Roshchupkin G., 2021, GenNet framework: Interpretable deep learning for predicting phenotypes from genetic data, Communications Biology, 4: 1735.
https://doi.org/10.1038/s42003-021-02622-z
Watson D., 2020, Conceptual challenges for interpretable machine learning, Synthese, 200: 1-20.
https://doi.org/10.1007/s11229-022-03485-5
Watson D., 2021, Interpretable machine learning for genomics, Human Genetics, 141: 1499-1513.
https://doi.org/10.1007/s00439-021-02387-9
Zhang Z.W., Ersoz E., Lai C.Q., Todhunter R., Tiwari H.K., Gore M.A., Bradbury P.J., Yu J.M., Donna K Arnett D.K., Ordovas J.M., and Buckler E.S., 2010, Mixed linear model approach adapted for genome-wide association studies, Nature Genetics, 42(4): 355-360.
https://doi.org/10.1038/ng.546
Zingaretti L., Gezan S., Ferrão L., Osorio L., Monfort A., Muñoz P., Whitaker V., Pérez-Enciso M., 2020, Exploring deep learning for complex trait genomic prediction in polyploid outcrossing species, Frontiers in Plant Science, 11: 25.
https://doi.org/10.3389/fpls.2020.00025

. HTML
Associated material
. Readers' comments
Other articles by authors
. xuanjun Fang
Related articles
. Statistical genetics
. Machine learning
. Explainable AI
. Causal inference
. Representation learning
. Linear mixed models
. Multi-omics integration
. Structural constraints
Tools
. Post a comment
.png)
.png)